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Regularization 


Roadmap 
@ When Can Machines Learn? 
@ Why Can Machines Learn? 
© How Can Machines Learn? 
© How Can Machines Learn Better? 









Lecture 13: Hazard of Overfitting 


overfitting happens with excessive power, 
stochastic/deterministic noise, and limited data 


Lecture 14: Regularization 


ə Regularized Hypothesis Set 
ə Weight Decay Regularization 

e Regularization and VC Theory 
ə General Regularizers 
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Regularization Regularized Hypothesis Set 


Regularization: The Magic 








T T 


‘regularized fit’ == overfit 


e idea: ‘step back’ from H19 to He 









e name history: function approximation for i!!-posed problems 


how to step back? | 
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Regularization Regularized Hypothesis Set 


Stepping Back as Constraint 


Q-th order polynomial transform for x € R: 
vege = (eee a 


+ linear regression, denote w by w 







hypothesis win Hig: Wo + WyX + Wox? + Wax3 +... + Wyox!? 


hypothesis w in Ho: Wo + WX + Wax? 





that is, H2 = Hi9 AND ‘constraint that ws 








W4 


step back = constraint 


Regularization Regularized Hypothesis Set 


Regression with Constraint 


Hip = {w € Ry Ho = {w c RIH 














while W3 W4 etd Wi0 o} 
regression with H19: | regression with Ho: 
min En(w) min En(w) 


weR!0+1 weR0+1 
s.t. 











W3 W4 T Wio 0 


step back = constrained optimization of Ein 
why don’t you just use w € R*+'? :-) 





Regularization Regularized Hypothesis Set 


Regression with Looser Constraint 









Ho = {w ener Ho = {w ERO 
while wg =... = mo =0} while > 8 of wq =0} 
regression with Ho: regression with H5: 
a EBC) EE 
10 
s.t. M= — Wo O s.t. X [m 40] <3 


q=0 
e more flexible than H2: Hoa GH, 
e less risky than H40: He E Ho 
bad news for sparse hypothesis set H5: 
NP-hard to solve :-( 
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Regularization Regularized Hypothesis Set 


Regression with Softer Constraint 


Hh = {we R's" H(C) = fw e p'o 











while > 8 of wg = o} while ||w||2? < c} 


regression with H5: regression with H(C) : 


10 
min Ein(w) s-t. X [w 4 0] <3 


10 
: 2 
min En(w) s.t. Ww 
weR10+ weR10+1 in( ) 2 Gh 
q=0 q=0 





e H(C): overlaps but not exactly the same as H, 
e soft and smooth structure over C > 0: 
H(0) c H(1.126) c ... C Hi 1120) c ... C Fico) 





regularized hypothesis Wrec: 
optimal solution from 
regularized hypothesis set H(C) 
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Regularization Regularized Hypothesis Set 


Fun Time 


For Q > 1, which of the following hypothesis (weight vector w € R@+") 
is not in the regularized hypothesis set (1)? 


© w’ = (0,0,...,0] 
@ w’ = [1,0,. : 
ow ni 


00 = [Vain e 





Regularization Regularized Hypothesis Set 


Fun Time 


For Q > 1, which of the following hypothesis (weight vector w € R@*") | 


is not in the regularized hypothesis set (1)? 
© w” =(0,0,...,0] 
@ w’ = [1,0,. 
© w’ =[1,1,.. 


om a Van Vk 








Reference Answer: 6) 


The squared length of w in (3) is Q + 1, which 
is not < 1. 
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Regularization Weight Decay Regularization 


Matrix Form of Regularized Regression Problem 


N 
1 
min En(w) = a (w a 
Q+1 
weR@+ N^ 
(Zw—y)"(Zw—y) 
s.t. 


Q 
> 
q=0 
m 
ww 


e 5... = (Zw — y)" (Zw — y), remember? :-) 
e ww < C: feasible w within a radius-\/C hypersphere 


how to solve 
constrained optimization problem? 





Regularization Weight Decay Regularization 


The Lagrange Multiplier 


Te ee T T 
am En(w) = w" —y) (Zw-y)st.ww<C 





Ei, = const. 


decreasing direction: —V En(w), 
remember? :-) 

normal vector of ww = C: w 

e if —V En(w) and w not parallel: can 
decrease £;,,(w) without violating 
the constraint 

at optimal solution Wrec, 

—V Ein(Wrea) «| Were 


want: find Lagrange multiplier A > 0 and Wreg 
such that V Ein(Wrec) + 5)/Wree |= 0 
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Regularization Weight Decay Regularization 


Augmented Error 
e if oracle tells you à > 0, then 





2X 
solving V Ein(Wrea) + N Wrec |= 0 














(27 ZWrec = zTy) + 








ZIN 


N Wrec | = 0 





e optimal solution: 


Wrea e (Z'Z +4 Al) Zy 


—called ridge regression in Statistics 





minimizing unconstrained Eaug effectively 
minimizes some C-constrained Ein 


Regularization Weight Decay Regularization 


Augmented Error 


e if oracle tells you à > 0, then 












2X 











solving V Ein(Wrea) + WN Wrec |= 0 
regularizer 
ee A. Coa 
equivalent to minimizing Ein(w) + yn ww 
ee 


augmented error Eaug(W) 
e regularization with augmented error instead of constrained Ein 


Wree + argmin Eaug(w) for given \ > 0 or \ = 0 
w 


minimizing unconstrained Eaug effectively 
minimizes some C-constrained Ein 
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Regularization Weight Decay Regularization 











The Results 
N= = 0.0001 = 0.01 
o Data 
— Target 
“ Ne aN = Ti 
overfitting underftting 


philosophy: a little regularization goes a long way! 





call ‘+4w!w’ weight-decay regularization: 


larger \ 
<=> prefer shorter w 
<=> effectively smaller C 


—go with ‘any’ transform + linear model 
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Regularization Weight Decay Regularization 


Some Detail: Legendre Polynomials 


N Q 
A 
wi 2 2 
jin PO P) — Yo) + yD ma 
naive polynomial transform: normalized polynomial transform: 
x (xe ee (1, Li(x), Lo(x),.--, La(x)) 
—when xn € [-1, +1], x7 really | —‘orthonormal basis functions’ 


called Legendre polynomials 





small, needing large Wg 








Lı 


z $(3a? — 1) 5z? — 3: (35x4 — 302 + 3) 
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Fun Time 
When would Wreg equal Win? 
@ A=0 
O C=ca00 
© C> [win]? 


© all of the above 





Regularization Weight Decay Regularization 


Fun Time 


When would Wreg equal Win? 


@A=0 
O C=% 
© C> win]? 


© all of the above 












Reference Answer: (4) 


D and (2) shall be easy; G) means that 


there are effectively no constraint on w, hence 
the equivalence. 
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Regularization Regularization and VC Theory 


Regularization and VC Theory 












VC Guarantee of 
—> KEE AVA E = 


Four(W) < Ein(w) + Q(H(C)) 


Regularization by 
Constrained-Minimizing Ein 


min En(w) s.t. wiw<C 


tt C equivalent to some à 





Regularization by 
Minimizing Eaug 





À 
min Eaug(W) = En(w) + JWW 


minimizing Eaug: indirectly getting VC 
guarantee without confining to H(C) | 
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Regularization Regularization and VC Theory 


Another View of Augmented Error 











Augmented Error VC Bound 
Eaug(W) = En(w) + Aw! w Four(W) < Ein(w) + Q(H) 
e regularizer w/w : complexity of a single hypothesis 


e generalization price Q(H): complexity of a hypothesis set 
e if ÀQ(w) ‘represents’ Q(H) well, 
Eaug is a better proxy of Eout than Ein 





minimizing Eaug: 


(heuristically) operating with the better proxy; 
(technically) enjoying flexibility of whole H. 
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Regularization Regularization and VC Theory 


Effective VC Dimension 


min Eaug(W) = En(w) + À9(w) 


weR?@+1 
e model complexity? 
Avc(H) = d +1, because {w} ‘al! considered’ during minimization 
e {w} ‘actually needed’: H(C), with some C equivalent to à 
© ho (H(C)): 
effective VC dimension dgre(H, A, ) 
Ww 


min Eaug 








explanation of regularization: 
Ayc(H) large, 
while dere(H, A) small if A regularized 
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Regularization Regularization and VC Theory 


Fun Time 


Consider the weight-decay regularization with regression. When 
increasing à in A, what would happen with dere(H, A)? 

@ derr Î 

@ derr 4 

© dere = Avc(H) and does not depend on à 

© dere = 1126 and does not depend on à 
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Regularization Regularization and VC Theory 


Fun Time 


Consider the weight-decay regularization with regression. When 
increasing à in A, what would happen with dere(H, A)? 
© Ogre i 


@ d:r} 
© derr = Avc(H) and does not depend on à 


© dere = 1126 and does not depend on à 


Reference Answer: © 


larger A 
<= smaller C 
<= smaller H(C) 
<=> smaller err 
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Regularization General Regularizers 


General Regularizers Q(w) 


want: constraint in the ‘direction’ of target function | 


e target-dependent: some properties of target, if known 
e symmetry regularizer: X- [q is odd] w 

e plausible: direction towards smoother or simpler 

stochastic/deterministic noise both non-smooth 

e sparsity (L1) regularizer: 5° |wg| (next slide) 

e friendly: easy to optimize 
e weight-decay (L2) regularizer: X wg 

e bad? :-): no worries, guard by A 





augmented error = error err + regularizer Q 
regularizer: target-dependent, plausible, or friendly 
ringing a bell? :-) 
error measure: user-dependent, plausible, or friendly 
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Regularization General Regularizers 


L2 and L1 Regularizer 


En = const. Fin = const. 












L2 Regularizer 
Q 
aw) =X "3 = Iwi? 


e convex, differentiable 
everywhere 


L1 Regularizer 
Q 
Q(w) = Das [Wa] = [lls 


e convex, not differentiable 
everywhere 





e easy to optimize e sparsity in solution 


L1 useful if needing sparse solution 
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Regularization General Regularizers 


The Optimal 


stochastic noise deterministic noise 


Expected Eout 
St 
Il 
fæ 
bo 
d 
Expected Eout 








i aw 15 2 oa A 15 A 
Regularization Parameter, ÀA Regularization Parameter, 





e more noise <=> more regularization needed 
—more bumpy road <= putting brakes more 


e noise unknown—important to make proper choices 





how to choose? 
stay tuned for the next lecture! :-) 
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Regularization General Regularizers 


Fun Time 


Consider using a regularizer 2(w) = S72.» 24w2 to work with 
Legendre polynomial regression. Which kind of hypothesis does the 
regularizer prefer? 


@ symmetric polynomials satisfying h(x) = h(—x) 
© low-dimensional polynomials 

© high-dimensional polynomials 

© no specific preference 
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Regularization General Regularizers 


Fun Time 


Consider using a regularizer 2(w) = 372.» 29w2 to work with 
Legendre polynomial regression. Which kind of hypothesis does the 
regularizer prefer? 

@ symmetric polynomials satisfying h(x) = h(—x) 

© low-dimensional polynomials 

® high-dimensional polynomials 

© no specific preference 


Reference Answer: (2) 


There is a higher ‘penalty’ for higher-order 
terms, and hence the regularizer prefers | 





low-dimensional polynomials. 
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Regularization General Regularizers 
Summary 
@ When Can Machines Learn? 
@ Why Can Machines Learn? 
© How Can Machines Learn? 
@ How Can Machines Learn Better? 


Lecture 13: Hazard of Overfitting 






Lecture 14: Regularization 


ə Regularized Hypothesis Set 
original H + constraint 
ə Weight Decay Regularization 
add jw! w in Eaug 
e Regularization and VC Theory 
regularization decreases derf 
ə General Regularizers 
target-dependent, [plausible], or [friendly] 











e next: choosing from the so-many models/parameters 
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